NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FluidFaaS: A Dynamic Pipelined Solution for Serverless Computing with Strong Isolation-based GPU Sharing

Hui, Xinning Hui; Xu, Yuanchao; Shen, Xipeng (July 2025, The ACM 33rd International Symposium on High-Performance Parallel and Distributed Computing)

Full Text Available
Generalizing Reuse Patterns for Efficient DNN on Microcontrollers

https://doi.org/10.1145/3676641.3716257

Liu, Jiesong; Ren, Bin; Shen, Xipeng (March 2025, ACM)

Full Text Available
Exploring Function Granularity for Serverless Machine Learning Application with GPU Sharing

https://doi.org/10.1145/3711699

Hui, Xinning; Xu, Yuanchao; Shen, Xipeng (March 2025, Proceedings of the ACM on Measurement and Analysis of Computing Systems)

Recent years have witnessed increasing interest in machine learning (ML) inferences on serverless computing due to its auto-scaling and cost-effective properties. However, one critical aspect, function granularity, has been largely overlooked, limiting the potential of serverless ML. This paper explores the impact of function granularity on serverless ML, revealing its important effects on the SLO hit rates and resource costs of serverless applications. It further proposes adaptive granularity as an approach to addressing the phenomenon that no single granularity fits all applications and situations. It explores three predictive models and presents programming tools and runtime extensions to facilitate the integration of adaptive granularity into existing serverless platforms. Experiments show adaptive granularity produces up to a 29.2% improvement in SLO hit rates and up to a 24.6% reduction in resource costs over the state-of-the-art serverless ML which uses fixed granularity.
more » « less
Full Text Available
Reductive Analysis with Compiler-Guided Large Language Models for Input-Centric Code Optimizations

https://doi.org/10.1145/3729282

Wang, Xiangwei; Hui, Xinning; Liao, Chunhua; Shen, Xipeng (June 2025, Proceedings of the ACM on Programming Languages)

Input-centric program optimization aims to optimize code by considering the relations between program inputs and program behaviors. Despite its promise, a long-standing barrier for its adoption is the difficulty of automatically identifying critical features of complex inputs. This paper introduces a novel technique,reductive analysis through compiler-guided Large Language Models (LLMs), to solve the problem through a synergy between compilers and LLMs. It uses a reductive approach to overcome the scalability and other limitations of LLMs in program code analysis. The solution, for the first time, automates the identification of critical input features without heavy instrumentation or profiling, cutting the time needed for input identification by 44× (or 450× for local LLMs), reduced from 9.6 hours to 13 minutes (with remote LLMs) or 77 seconds (with local LLMs) on average, making input characterization possible to be integrated into the workflow of program compilations. Optimizations on those identified input features show similar or even better results than those identified by previous profiling-based methods, leading to optimizations that yield 92.6% accuracy in selecting the appropriate adaptive OpenMP parallelization decisions, and 20–30% performance improvement of serverless computing while reducing resource usage by 50–60%.
more » « less
Full Text Available
Mobile-3DCNN: An Acceleration Framework for Ultra-Real-Time Execution of Large 3D CNNs on Mobile Devices

https://doi.org/10.1145/3747842

Niu, Wei; Sun, Mengshu; Li, Zhengang; Chen, Jou-An; Guan, Jiexiong; Shen, Xipeng; Liu, Jun; Zhang, Mei; Wang, Yanzhi; Lin, Xue; et al (July 2025, ACM Transactions on Architecture and Code Optimization)

It is challenging to deploy 3D Convolutional Neural Networks (3D CNNs) on mobile devices, specifically if both real-time execution and high inference accuracy are in demand, because the increasingly large model size and complex model structure of 3D CNNs usually require tremendous computation and memory resources. Weight pruning is proposed to mitigate this challenge. However, existing pruning is either not compatible with modern parallel architectures, resulting in long inference latency or subject to significant accuracy degradation. This paper proposes an end-to-end 3D CNN acceleration framework based on pruning/compilation co-design called Mobile-3DCNN that consists of two parts: a novel, fine-grained structured pruning enhanced by a prune/Winograd adaptive selection (that is mobile-hardware-friendly and can achieve high pruning accuracy), and a set of compiler optimization and code generation techniques enabled by our pruning (to fully transform the pruning benefit to real performance gains). The evaluation demonstrates that Mobile-3DCNN outperforms state-of-the-art end-to-end DNN acceleration frameworks that support 3D CNN execution on mobile devices, Alibaba Mobile Neural Networks and Pytorch-Mobile with speedup up to 34 × with minor accuracy degradation, proving it is possible to execute high-accuracy large 3D CNNs on mobile devices in real-time (or even ultra-real-time).
more » « less
Full Text Available
ESG: Pipeline-Conscious Efficient Scheduling of DNN Workflows on Serverless Platforms with Shareable GPUs

https://doi.org/10.1145/3625549.3658657

Hui, Xinning; Xu, Yuanchao; Guo, Zhishan; Shen, Xipeng (June 2024, ACM)

Full Text Available
Data Enclave: A Data-Centric Trusted Execution Environment

https://doi.org/10.1109/HPCA57654.2024.00026

Xu, Yuanchao; Pangia, James; Ye, Chencheng; Solihin, Yan; Shen, Xipeng (March 2024, IEEE)

Full Text Available
SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile

https://doi.org/10.1145/3620666.3651384

Niu, Wei; Sanim, Md_Musfiqur Rahman; Shu, Zhihao; Guan, Jiexiong; Shen, Xipeng; Yin, Miao; Agrawal, Gagan; Ren, Bin (April 2024, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Full Text Available
Survey: Exploiting Data Redundancy for Optimization of Deep Learning

https://doi.org/10.1145/3564663

Chen, Jou-An; Niu, Wei; Ren, Bin; Wang, Yanzhi; Shen, Xipeng (October 2023, ACM Computing Surveys)

Data redundancy is ubiquitous in the inputs and intermediate results of Deep Neural Networks (DNN) . It offers many significant opportunities for improving DNN performance and efficiency and has been explored in a large body of work. These studies have scattered in many venues across several years. The targets they focus on range from images to videos and texts, and the techniques they use to detect and exploit data redundancy also vary in many aspects. There is not yet a systematic examination and summary of the many efforts, making it difficult for researchers to get a comprehensive view of the prior work, the state of the art, differences and shared principles, and the areas and directions yet to explore. This article tries to fill the void. It surveys hundreds of recent papers on the topic, introduces a novel taxonomy to put the various techniques into a single categorization framework, offers a comprehensive description of the main methods used for exploiting data redundancy in improving multiple kinds of DNNs on data, and points out a set of research opportunities for future exploration.
more » « less
Full Text Available
BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUs

https://doi.org/10.1145/3577193.3593725

Chen, Jou-An; Sung, Hsin-Hsuan; Shen, Xipeng; Choudhury, Sutanay; Li, Ang (June 2023, ACM)

Full Text Available

« Prev Next »

Search for: All records